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Abstract 

In this paper we combine the advantages of a model using 
global source sentence contexts, the Discriminative Word 
Lexicon, and neural networks. By using deep neural net¬ 
works instead of the linear maximum entropy model in the 
Discriminative Word Lexicon models, we are able to lever¬ 
age dependencies between different source words due to the 
non-linearity. Furthermore, the models for different target 
words can share parameters and therefore data sparsity prob¬ 
lems are effectively reduced. 

By using this approach in a state-of-the-art translation 
system, we can improve the performance by up to 0.5 BLEU 
points for three different language pairs on the TED transla¬ 
tion task. 

1. Introduction 

Since the first attempt to statisical machine translation (SMT) 
12, the approach has drawn much interest in the research 
community and huge improvements in translation quality 
have been achieved. Still, there are plenty of problems in 
SMT which should be addressed. One is that the translation 
decision depends on a quite small context. 

In standard phrase-based statistical machine translation 
(PBMT) 01, the two main components are the translation 
and language models. The translation model is modeled by 
counting phrase pairs, which are sequences of words ex¬ 
tracted from bilingual corpora. By using phrase segments 
instead of words, PBMT can exploit some local source and 
target contexts within those segments. But no context in¬ 
formation outside the phrase pairs is used. In an /(-gram 
language model, only a context of up to n target words is 
considered. 

Several directions have been proposed to leverage in¬ 
formation from wider contexts in the phrase-based SMT 
framework. For example, the Discriminative Word Lexicon 
(DWL) Q f4j exploits the occurence of all the words in the 
whole source sentence to predict the presence of words in the 
target sentence. This wider context information is encoded 
as features and employed in a discriminative framework. 
Hence, they train a maximum entropy (MaxEnt) model for 
each target word. 


While this model can improve the translation quality in 
different conditions, MaxEnt models are linear classifiers. 
On the other hand, hierarchical non-linear classifiers can 
model dependencies between different source words better 
since they perform some abstraction over the input. Hence, 
introducing non-linearity into the modeling of the lexical 
translation could improve the quality. Moreover, since many 
pairs of source and target words co-occur only rarely, a way 
of sharing information between the different classifiers could 
improve the modeling as well. 

In order to address these issues, we developed a discrim¬ 
inative lexical model based on deep neural networks. Since 
we train one neural network for all target words as a mul¬ 
tivariate binary classifier, the model can share information 
between different target words. Furthermore, the probabil¬ 
ity is no longer a linear combination of weights depending 
on the surface source words. Thanks to the non-linearity, we 
are now able to exploit semantic dependencies among source 
words. 

This paper is organized as follows. In Section [2j we re¬ 
view the previous works related to lexical translation meth¬ 
ods as well as the translation modeling using neural net¬ 
works. Then we describe our approach including the network 
architecture and its training procedures in Section 0 Sec¬ 
tion 0 provides experimental results of our translation sys¬ 
tems for different language pairs using the proposed lexical 
translation model. Finally, the conclusions are drawn in Sec¬ 
tion 0 

2. Related work 

Since the beginnings of SMT, several approaches to increase 
the context used for lexical decisions have been presented. 
When moving from word-based to phrase-based SMT If2ll5l , 
a big step in employing wider contexts into translation sys¬ 
tems has been made. In PBMT, the lexical joint models al¬ 
low us to use local source and target contexts in the form of 
phrases. Lately, advanced joint models have been proposed 
to either enhance the joint probability model between source 
and target sides or engage more suitable contexts. 

The n-gram based approach (6j directly models the joint 
probability of source and target sentences from the condi¬ 
tional probability of a current n-gram pair givens sequences 


of previous bilingual n-grams. In 0, this idea is introduced 
into the phrase-based MT approach. Thereby, parallel con¬ 
text over phrase boundaries can be used during the transla¬ 
tion. 

Standard phrase-based or rt-gram translation models are 
basically built upon statistical principles such as Maximum 
Entropy and smoothing techniques. Recently, joint models 
are learned using neural networks where non-linear transla¬ 
tion relationships and semantic generalization of words can 
be performed (§1. Le et. al. (9) follow the n-gram transla¬ 
tion direction but model the conditional probability of a tar¬ 
get word given the history of bilingual phrase pairs using a 
neural network architecture. They then use their model in a 
fc-best rescorer instead of in their n-gram decoder. Devlin 
et. al. IfTOl add longer source contexts and renew the joint 
formula so that it can be included in a decoder rather than a 
fc-best rescoring module. Schwenk et. al. CD calculate the 
conditional probability of a target phrase instead of a target 
word given a source phrase. 

Although the aforementioned works essentially augment 
the joint translation model, they have an inherent limitation: 
only exploit local contexts. They estimate the joint model us¬ 
ing sequences of words as the basic unit. On the other hand, 
there are several approaches utilizing global contexts. Moti¬ 
vated by Bangalore et. al CD , Hasan et. al. ITJft calculate the 
probability of a target word given two source words which do 
not necessarily belong to a phrase. Mauser et. al. 0 sug¬ 
gest another lexical translation approach, named Discrimina¬ 
tive Word Lexicon (DWL), concentrating on predicting the 
presence of target words given the source words. Niehues 
et. al. 0 extend the model to employ the source and target 
contexts, but they used the same MaxEnt classifier for the 
task. Carpuat et. al. OH) is the most similar work to the 
DWL direction in terms of using the whole source sentence 
to perform the lexical choices of target words. They treat the 
selection process as a Word Sense Disambiguation (WSD) 
task, where target words or phrases are WSD senses. They 
extract a rich feature set from the source sentences, includ¬ 
ing source words, and input them into a WSD classifier. Still, 
the problem persists since they use the shallow classifiers for 
that task. 

Considering the advantages of non-linear models men¬ 
tioned before, we opt for using deep neural network archi¬ 
tectures to learn the DWL. We take the advantages of the two 
directions. On one side, our model uses a non-linear classi¬ 
fication method to leverage dependencies between different 
source sentences as well as its semantic generalization abil¬ 
ity. On the other side, by employing the global contexts, our 
model can complement joint translation models which use 
the local contexts. 

3. Discriminative lexical translation using 
deep neural networks 

We will first review the original DWL approach described in 
0 and 0- Afterwards, we will describe the neural network 


architecture and training procedures proposed in this work. 
We will finish this section by describing the integration into 
the decoding process. 

3.1. Original Discriminative Word Lexicon 

In this approach, the DWL are modeled using a maximum 
entropy model to determine the probability of using a target 
word in the translation. Therefore, individual models for ev¬ 
ery target word are trained. Each model is trained to return 
the probability of this word given the input sentence. 

The input of the model is the source sentence, thus, they 
need a way to represent the input sentence. This is done by 
representing the sentence as a bag of words and thereby ig¬ 
noring the order of the words. In the MaxEnt model, they 
use an indicator feature for every input word. More formally, 
a given source sentence s = si... sj is represented by the 
features F(s) = {f w (s) : Vw £ V s }, with V s is the source 
vocabulary: 

£ f \ / 1 if w e s 

«*> = t 0 if » *» (1) 

The models are trained on examples generated by the par¬ 
allel training data. The labels for training the classifier of 
target word t 3 are defined as follows: 

label tj (s,t) = { l \\\ ( 2 ) 

This model approximates the probability p(tj\s) of a target 
word tj given the source sentence s. We will discuss our 
alternative method using neural network to estimate those 
probabilities in the next section. 

In 0, the source context is considered in a way that the 
sentence is no longer represented by a bag of words, but by 
a bag of ngrams. Using this representation, they could inte¬ 
grate the order information of the words, but the dimension 
of the input space is increased. We also adapt this extension 
to our model by encoding the bigrams and trigrams as ordi¬ 
nary words in the source vocabulary. 

After inducing the probability for every word tj given the 
source sentence s, these probabilities were combined into the 
probability of the whole target sentence t = t± ... t j given s 
as described in Section l3~4l 

3.2. General network architecture 

After we reviewed the original DWL in the last section, we 
will now describe the neural network that replaces the Max¬ 
Ent model for calculating the probabilities p(tj\s). 

The input and output of our neural network-based DWL 
are the source and target sentences from which we would like 
to learn the lexical translation relationship. As in the origi¬ 
nal DWL approach, we represent each source sentence s as 
a binary column vector s £ {011} I ^' with V s being the con¬ 
sidered vocabulary of the source corpus. If a source word .s. 


V s 


Vt 



Figure 1: FFNN architecture for learning lexical translation. 


appears in that sentence s, the value of the corresponding in¬ 
dex i in s is 1, and 0 otherwise. Hence, the source sentence 
representation should be a sparse vector, depending on the 
considered vocabulary V s . The same representation scheme 
is applied to the target sentence t to get a sparse binary col¬ 
umn vector t with the considered target vocabulary V t . 

As the Figure Q] depicts, our main neural network-based 
DWL architecture for learning lexical translation is a feed¬ 
forward neural network (FFNN) with three hidden layers. 
The matrix £ jg>v 3 x|.Hi| connec t s the input layer to 

the first hidden layer. Two matrices W^ 2 ) £ jjl-Hilxl-Hal anc | 
W^ 3 ) £ M) H2 1 x l^ 3 1 encodes the learned translation mapping 
between two compact global feature spaces of the source and 
target contexts. And the matrix W^ 4 ) £ Rl- ff 3l x l v *l computes 
the lexical translation output. |Hi|, |i? 2 |, and \H$\ are the 
number of units in the first, second and third hidden layers, 
respectively. The lexical translation distribution of the words 
in the target sentence p(ti\s) for a given source sentence s is 
computed by a forward pass: 


3.3. Network training 

In neural network training, for each instance, which is com¬ 
prised of a sentence pair (s,t), we maximize the similarity 
between the conditional probability pi = pe(U\s) to either 1 
or 0 depending on the appearance of the corresponding word 
ti in the target sentence t. The neural network operates as a 
multivariate classifier which gives the probabilistic score for 
a binary decision of independent variables, i.e the appear¬ 
ances of target words. Here we minimize the cross entropy 
error function between the binary target sentence vector t 
and the output of the network p = [p,]: 

1 Vi 

E = y^(tjln pi + (1 - tj) ln(l -pi)) 

t A -1 


We train the network by back-propagating the error based 
on the gradient descent principle. The error gradient for the 
weights between the last layer and the output is calculated as: 


dE 


dw. 


( 4 ) 


= (o, 4 — tj)0- 3 


The error gradient for the weights between the other lay¬ 
ers is calculated based on the error gradients for activation 
values from the previous layers: 


dE = 8E (fc _D 
dw g° dOf ' 


Then the weight matrices are batch-updated after each 
epoch: 


N 

w (k) [T + 1] = W (fe) [T]-r]J2 

i =1 


dE 

9WW 


where: 


p(U\s) = <Ji{ W (4)T 0 (3) ) 


• N is the number of training instances. 


where: 


Q ( fc ) 


C r j (W( fe ) T 0( fe - 1 )) 


k £ {1,2,3} 


O (0) = s and 0 (4) = p(t, s) 


• pis the learning rate of the network. 

• W( fe ) [T + 1] is the weight matrix of the layer k after 
T + 1 epochs of training. 


and o-j is the sigmoid function cr(x) applied to the j th value 
in a column vector: 


3.4. Sentence-level lexical translation scoring 

With the independence assumption among target words, the 
target probabilities are combined to form the sentence-level 
lexical translation score: 


So the parameters of the network are: 

d = (W (1) , W (2) , W (3) , W (4) ) 


p(t\s) = n p(tj\s) o) 

tj£v t 


To investigate the impact of the network configuration, 
we built a simpler architecture with only one hidden layer 
featuring the translation relationship between source and tar¬ 
get sentences. We will refer this as the SimNNDWL in the 
comparison section later. 


where vt is the set of all target words appearing in the target 
sentence t,. 

In Equation [3] we need to update the lexical translation 
score only if a new word appears in the hypothesis. That 
means we do not take into account the frequency of words 






































but multiply the probability of one word only once even if 
the word occurs several times in the sentence. Other models 
in our translation system, however, will restrict overusing a 
particular word. Furthermore, to keep track of which words 
whose probabilities have been calculated already, additional 
book keeping would be required. In order to avoid those dif¬ 
ficulties, we come up with the following approximation given 
J is the length of the target sentence t\ 

j 

^is)=(4) 

3 =1 

In order to speed up the calculation of the target word 
probabilities, we pre-calculate all probabilities for a given 
source sentence prior to translations. In a naive approach we 
would need to pre-calculate the probabilities for all possible 
target words given the source sentence. This would lead to a 
very slow calculations. Therefore, we first define the target 
vocabulary of a source sentence as the vocabulary comprised 
of the respective words from the phrase pairs matching to the 
source sentence. Using this definition, we only need to pre¬ 
calculate the probabilities of all words in the target side of 
the phrase table and not all target words in the whole corpus. 
And we can calculate the score for every phrase pair even 
before starting with the translation. 

4. Experiments 

In this section, we describe the translation system we use for 
the experiments, the configurations of the NNDWL and the 
results of those experiments. 

4.1. System description 

The system we use as our baseline is a state-of-the-art trans¬ 
lation system for English to French without any DWL. To the 
baseline system, we add several DWL components trained 
on different corpora as independent features in the log-linear 
framework utilized by our in-house phrase-based decoder. 

The system is trained on the EPPS, NC, Common Crawl, 
Giga corpora and TED talks fT51 . The monolingual data we 
used to train language models includes the corresponding 
monolingual parts of those parallel corpora plus News Shuf¬ 
fle and Gigaword. The data is preprocessed and the phrase 
table is built using the scripts from the Moses package ED. 
We adapt the general, big corpora to the in-domain TED data 
using the Backoff approach described in D3. Adaptation is 
also conducted for the monolingual data. We train a 4-gram 
language model using the SRILM toolkit lfj~8] . In addition, 
several non-word language models are included to capture 
the dependencies between source and target words and re¬ 
duce the impact of data sparsity. We use a bilingual language 
model as described in 0 as well as a cluster language model 
based on word classes generated by the MKCLS algorithm 
ED- Short-range reordering is performed as a preprocessing 
step as described in l20l . 


Our in-house phrase-based decoder is used to search for 
the best solutions among translation hypotheses and the op¬ 
timization of the 13 to 17 features, depending on the settings 
we use, is performed using Minimum Error Rate Training 
ED. The weights are optimized and tested on two separate 
sets of TED talks. The development set consists of 903 sen¬ 
tences containing 20k words. The test set consists of 1686 
sentences containing 33k words. 

We investigate the impact of our approach by employ¬ 
ing different configurations of the neural networks described 
in details in the following section. We then evaluate those 
configurations not only for English—^French but also for 
English—>Chinese and German—^English with similar trans¬ 
lation system setups. 

Our NNDWL models are trained on a small subset of the 
mentioned training corpora, mainly the TED data. Although 
the TED corpus is quite small compared to the overall 
training data, it is very important since it matches best the 
test data. In order to speed up the process of testing different 
configurations, we therefore train the NNDWL only on this 
corpus except for the comparison reported in Section 14.3.41 
The statistics of the training and validation data for the 
NNDWL are shown in Table [Q 



En-Fr 

En-Zh 

De-En 

Training 

Sent. 

149991 

140006 

130654 

Tok. (avg.) 

3.1m 

3.3m 

2.5m 

Validation 

Sent. 

6153 

8962 

7430 

Tok. (avg.) 

125k 

211k 

142k 


Table 1: Statistics of the corpora used to train NNDWL 


4.2. Network configurations 

In our main neural network architecture we proposed, the 
sizes of the hidden layers |iTi|, \H 2 \, \Hz\ are 1000, 500, 
1000, respectively. If we use the original source and tar¬ 
get vocabularies, for the English—^French direction trained 
on preprocessed TED 2013 data, V s includes 47957 words 
and V t includes 62660 words. Because of the non-linearity 
calculations through such a large network, the training is ex¬ 
tremely time-consuming. In order to boost the efficiency, we 
limit the source and target vocabularies to the most frequent 
ones. All words outside the lists are treated as unknown 
words. We vary the size of the considered vocabularies from 
the values {500,1000,2000, 5000} while keeping the sizes 
of the hidden layers the same (i.e. 1000 x 500 x 1000). In 
preliminary experiments, this layout lead to the best perfor¬ 
mance. So we used this layout for the remaining of the paper. 

The same calculation problem occurs with the source 
contexts, even more seriously due to the curse of dimen- 
tionality. Hence, we applied the same cut-off scheme to the 
source-side bigrams and trigrams with the most-frequent bi¬ 
gram and trigram numbers set at (200,100), (500, 200) and 
(1000,500). 
















The simpler architecture SimNNDWL consisting of one 
1000-unit hidden layer is compared to the main architecture 
with the same setup. 

For training our proposed architecture, the gradient de¬ 
scent with a batch size of 15 and a learning rate of 0.02 is 
used. Gradients are calculated by averaging across a mini¬ 
batch of training instances and the process is performed for 
35 epochs. After each epoch, the current neural network 
model is evaluated on a separate validation set, and the model 
with the best performance on this set is utilized for calculat¬ 
ing lexical translation scores afterwards. We regularize the 
models with the L 2 regularizer. As an alternative to the L 2 , 
we also experiment with the dropout technique lf22ll . where 
the neurons in the last hidden layer are randomly dropped out 
with the probability of 0.4. However, it did not help as indi¬ 
cated by its performance on the system later. The training is 
done on GPUs using the Theano Toolkit Il23l . 

4.3. Results 

Here we report the results using different NNDWL config¬ 
urations mainly for an English—^French translation system. 
We also report the results using the best configurations for 
other language pairs. 

4.3.1. Experiments with different vocabulary sizes 

The results of the English—^French translation system with 
NNDWL models trained with different vocabulary sizes are 
shown in Table [2] 


System (En-Fr) 

BLEU 

ABLEU 

Baseline 

31.94 

- 

MaxEnt DWL 

32.17 

+0.23 

NNDWL 500 

32.06 

+0.12 

NNDWL 1000 

32.37 

+0.43 

NNDWL 2000 

32.38 

+0.44 

NNDWL 5000 

32.07 

+0.13 

Full NNDWL 

32.06 

+0.12 


Table 2: Results of the English—^French NNDWL. 

Varying the vocabulary sizes for both source and target 
sentences not only helps to dramatically reduce neural net¬ 
work training time but also affects the translation quality. In 
our experiments, neural networks with 1000- and 2000-most- 
frequent-word vocabularies show the biggest improvements 
with around 0.44 BLEU points in translating from English to 
French. They perform better than the DWL using the max¬ 
imum entropy approach and the NNDWL with the whole 
source and target vocabularies. 

While all NNDWL models achieve notable BLEU gains 
compared to the strong baseline, some of them are worse than 
the original MaxEnt model. It might be due to the fact that 
the original MaxEnt model uses the source contexts whereas 
the NNDWL models uses just the source words. 


4.3.2. The impact of n-gram source contexts 

Tables [3] and Q] show the impact of bigrams and trigrams ex¬ 
tracted from source sentences. We also vary the numbers of 
the bigrams and trigrams which appeared most often. 


System (En-Fr) 

BLEU 

ABLEU 

Baseline 

31.94 

- 

NNDWL 2000 

32.38 

+0.44 

NNDWL 2000 SC-200-100 

32.35 

+0.41 

NNDWL 2000 SC-500-200 

32.44 

+0.50 

NNDWL 2000 SC-1000-500 

32.36 

+0.42 


Table 3: Results of the 2000-NNDWL with source contexts. 

For the NNDWL model with 2000-most-frequent-word 
vocabularies, including source contexts helps in some cases 
and does not harm the translation performance in the other 
cases. With the 500 most-frequent bigrams and 200 most- 
frequent trigrams, we achieve the best improvements of 0.5 
BLEU points over the baseline. 


System (En-Fr) 

BLEU 

ABLEU 

Baseline 

31.94 

- 

NNDWL 1000 

32.37 

+0.43 

NNDWL 1000 SC-200-100 

32.01 

+0.07 

NNDWL 1000 SC-500-200 

32.23 

+0.29 

NNDWL 1000 SC-1000-500 

32.39 

+0.45 


Table 4: Results of the 1000-NNDWL with source contexts. 

The gains from adding source contexts to the 1000- 
vocabulary-size NNDWL model are not clearly observed as 
in the case of the 2000-vocabulary-size model. This might in¬ 
dicate that we should set the numbers of the source contexts 
to be proportional somehow with the size of the vocabularies. 

4.3.3. The impact of using different architectures 


System (En-Fr) 

BLEU 

ABLEU 

Baseline 

31.94 

- 

NNDWL 1000 
SimNNDWL 1000 

32.37 

32.12 

+0.43 

+0.18 

NNDWL 2000 
SimNNDWL 2000 

32.38 

32.29 

+0.44 

+0.35 

NNDWL 5000 
SimNNDWL 5000 

32.07 

31.71 

+0.13 

-0.23 


Table 5: Results of NNDWL and SimNNDWL architectures. 

Here we compare our main architecture with the simpler 
architecture SimNNDWL consisting of one 1000-unit hidden 
layer. While the SimNNDWL trains faster (157 hours vs. 202 
hours for training English—^French with the whole vocab¬ 
ularies), translation time performance is not significantly af¬ 
fected. Since there are decreases in BLEU score using SimN- 
































NDWL architecture as shown in Tabled the deep architec¬ 
ture seems to have an advantage over the simple architecture. 
Hence, we stick with our main architecture for remaining ex¬ 
periments. 

4.3.4. The impact of data used to train NNDWL models 

We also train our NNDWL models on a bigger corpus con- 
catinating EPPS, NC and TED. The results in Table[6]shows 
that using a bigger corpus does not improve the translation 
quality. The DWL models trained on in-domain data only, 
i.e. TED, perform similar or better than the models trained 
on more data but broader domains. This observation also 
holds true for original the MaxEnt DWL models reported in 

El- 


System (En-Fr) 

BLEU 

ABLEU 

Baseline 

31.94 

- 

NNDWL 1000 on TED 

NNDWL 1000 on EPPS+NC+TED 

32.37 

32.33 

+0.43 

+0.39 


Table 6: Results of the NNDWL trained on different corpora. 

4.3.5. Other language pairs 

We conducted the experiments with NNDWL models mainly 
on our English-to-French translation system in order to inves¬ 
tigate the impact of our method on a strong baseline. How¬ 
ever, we would like to inspect the effect of the DWL on lan¬ 
guage pairs with long-range dependencies or differences in 
word order. 

For that purpose, we built similar NNDWL models and 
integrate them to our translation systems for other language 
pairs. Tables [7] and [8] show the results of English—^Chinese 
and German—^English, respectively. 


English—^Chinese 


System (En-Zh) 

BLEU 

ABLEU 

Baseline 

17.18 

- 

MaxEnt DWL 

16.78 

-0.40 

NNDWL 500 

17.09 

-0.09 

NNDWL 1000 

17.58 

+0.40 

NNDWL 1000 SC-200-100 

17.63 

+0.45 

NNDWL 2000 

17.26 

+0.08 

NNDWL 2000 SC-200-100 

17.20 

+0.02 


Table 7: Results of the English—^Chinese NNDWL 


In case of the English—^Chinese direction, the NNDWL 
significantly improves the translation quality, with an 
increment of 0.45 BLEU points over the baseline. That 
best BLEU gain comes from the NNDWL with 1000- 
most-frequent-word vocabularies and the source contexts 
containing 200 bigrams and 100 trigrams. 


German—^English 

In case of the German—^English direction, the NNDWL 
also helps to gain 0.34 BLEU points over the baseline with 
the best model (i.e. 2000 most-frequent-word vocabular¬ 
ies with source contexts). However, the improvements is 
not notably different compared to the original MaxEnt DWL. 


System (De-En) 

BLEU 

ABLEU 

Baseline 

29.70 

- 

MaxEnt DWL 

29.95 

+0.25 

NNDWL 500 

29.82 

+0.12 

NNDWL 1000 

29.92 

+0.22 

NNDWL 2000 

29.95 

+0.25 

NNDWL 2000 SC-500-200 

30.04 

+0.34 

NNDWL 5000 

29.89 

+0.19 


Table 8: Results of the Germans English NNDWL 


5. Conclusion 

In this paper we described a deep neural network approach 
for DWL modeling and the integration into a standard 
phrase-based translation system. Using neural networks as 
a non-linear classifier for DWL enables the ability of learn¬ 
ing the abstract representation of global contexts and their 
dependencies. We investigated various network configura¬ 
tions on different language pairs. When we deployed our 
best NNDWL model as a feature in our decoder, it helps to 
improve up to 0.5 BLEU points compared to a very strong 
baseline. 

Our NNDWL does not require linguistic resources nor 
feature engineering. Thus, it can easily be ported to new lan¬ 
guages. Furthermore, the probability calculation can be done 
in a preprocessing step. Therefore, the new model would not 
significantly slow down the translation process. Although we 
do not feature linguistic resources in our NNDWL, they can 
be useful in modeling the translation probability of the lan¬ 
guages from which they are avalaible. In future work we will 
try to integrate linguistic features into the model. Moreover, 
context vector of words might be helpful in further reducing 
the data sparseness problem. 
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